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Abstract. Dedicated systems are fundamental for neuroscience experi¬ 
mental protocols that require timing determinism and synchronous stim¬ 
uli generation. We developed a data acquisition and stimuli generator 
system for neuroscience research, optimized for recording timestamps 
from up to 6 spiking neurons and entirely specihed in a high-level Hard¬ 
ware Description Language (HDL). Despite the logic complexity penalty 
of synthesizing from such a language, it was possible to implement our 
design in a low-cost small reconfigurable device. Under a modular frame¬ 
work, we explored two different memory arbitration schemes for our 
system, evaluating both their logic element usage and resilience to in¬ 
put activity bursts. One of them was designed with a decoupled and 
latency insensitive approach, allowing for easier code reuse, while the 
other adopted a centralized scheme, constructed specihcally for our ap¬ 
plication. The usage of a high-level HDL allowed straightforward and 
stepwise code modifications to transform one architecture into the other. 
The achieved modularity is very useful for rapidly prototyping novel 
electronic instrumentation systems tailored to scientific research. 

Keywords: Spiking Neurons, Data Acquisition, Precise Timing, Re¬ 
source Arbitration, Latency Insensitive, Modular Design. 


1 Introduction 

Neurons usually behave by emitting stereotyped pulses of electric depolariza¬ 
tion through their membranes, creating temporally localized spikes. It is a com¬ 
mon belief that spiking neurons follow an all-or-none principle, similar to the 
processing of digital signals, by encoding information only through spike tim¬ 
ing m- Although each individual cell always produces the same waveform, the 
most widespread experimental approach employs Analog to Digital Convert¬ 
ers (ADCs) integrated on commercial acquisition systems to capture complete 
waveforms. This procedure is required when the researcher desires to analyze 
a large population of neurons recording only from a few electrodes, applying 
then a neuron classification technique known as spike sorting to discriminate 
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individual waveforms [2]. However, because of the lack of readily available spe¬ 
cialized acquisition hardware, many works adopt the same recording technique 
even though they only need to identify the occurrence of spikes from one neuron 
per electrode mmm- The resulting data files are large and spikes need to be 
detected by software, demanding a considerable amount of time. 

In this paper we present the design of a low-cost alternative hardware solu¬ 
tion based on a dedicated Complex Programmable Logic Device (CPLD). We 
have chosen CPLDs instead of Field Programmable Gate Arrays (FPGAs) to 
demonstrate the flexibility of our approach, as GPLDs are usually limited to a 
small number of logic gates, and lack common FPGA features such as Block 
RAMs and Phase Locked Loops (PLLs). We implemented the logic circuits on 
the CPLD adopting a modular design, which aims to facilitate future refinement 
and customization for specific applications. The complete source code imple¬ 
mented in the Bluespec SystemVerilog (BSV) [7] language is available at [S]. 

BSV designs targeted at small reconfigurable devices, such as ours, are rare 
in literature, since many works show that BSV usually produces a higher logic el¬ 
ement (LE) count than Register-Transfer Level (RTL) languages |9I10I11) . How¬ 
ever, some research m argues that microarchitectural choices have greater im¬ 
pact on the LE usage than the specification’s abstraction level, although there 
is a lack of studies in glue logic sized architectures with significant modularity 
and complexity. This paper showcases such a system, and also explores the im¬ 
pact of latency insensitive module decoupling |13j . by comparing two distinct 
implementations of a resource arbitration scheme. Similar work evaluating syn¬ 
thesis results exists El, but we also test the consequences on system resilience 
to extreme conditions, many times above our application requirements. 

The acquisition input is provided to our digital logic by an analog front-end 
system which generates an asynchronous TTL-compatible signal pulse at the oc¬ 
currence of each valid spike. Our entire circuit was designed to be compatible and 
easily inserted into a previous experimental setup m devised for studying neu¬ 
ral codification in Chrysomya megacephala's visual system, but it is sufficiently 
generic to be suitable for a wide range of neuroscience experiments. 


Main contributions of this work: 

— Develops a portable, low-cost and precise data acquisition system for neuro¬ 
science and neuroethology experiments. 

— Applies the seldom used concept of recording digital events (instead of ADC- 
converted data) to increase the precision of neural spike timing. 

— Employs the BSV language in a small and resource constrained system. 

— Showcases architecture refactoring from a decoupled to a centralized scheme. 


Paper organization: The next section describes the basic specifications of our 
design and its overall architecture. Section discusses the system implementa¬ 
tion, focusing on points common both to dynamic and static arbiter versions. 
Sections |3.1| and |3.2| delve into specific aspects of each one of the implementa- 
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tions. Section [^presents synthesis, experimental and simulation results. Finally, 
we conclude in Section [Sl 

2 Overall system architecture 

Our system offers 6 TTL-level pulse timestamp acquisition inputs, 4 analog 
16-bit resolution outputs for stimuli generation and a Join Test Action Group 
(JTAG) host computer interface. It is composed by a MAX II Micro Kit (EPM- 
2210F324C3 CPLD), a 74HC4050 buffer for input overvoltage protection, a 
MAX5134 Digital-to-Analog Converter (DAC) and a IDT7I256 20ns 32Kx8- 
bit SRAM. We have divided the project in following functional subunits: 

Synchronizer: Receives asynchronous input pulses and registers 32-bit times¬ 
tamps from a hardware counter, each one paired to a flag indicating which input 
channels fired since last counted. In most neural systems, I (is is believed to be 
enough resolution for studying fine details of information coding |16j . 

FIFO SRAM: Provides an interface for using the external SRAM memory 
as a pair of First-In First-Out (FIFO) queues of 16 KiB each. One of them 
buffers data acquired from inputs, and the other buffers stimuli received from a 
computer. Our FIFO modules are compatible with the BSV standard library. 

JTAG interface: Provides communication with a host computer. We have 
wrapped Altera JTAG-UART libraries into a ready-to-use BSV module. By us¬ 
ing this protocol, the same communication module is portable to any CPLD or 
FPGA manufactured by the same vendor. As programmable devices are config¬ 
ured via JTAG, the bus is readily available through USB adapters embedded in 
almost every evaluation board. However, this approach introduces a significant 
protocol overhead by encapsulating UART emulation inside JTAG-USB, limit¬ 
ing the data rate to about 1 Mbit/s. Also, client software needs to explicitly poll 
the device, because the interface is not interrupt nor event driven. This results 
in software determinism becoming a bottleneck depending on hardware buffer 
size and desired data rate. Nevertheless, these limitations do not impair this 
particular application. 


3 BSV module architecture 

Bluespec SystemVerilog is a strongly typed high-level hardware description lan¬ 
guage (HDL) with functional paradigm features. A BSV design is organized in 
modules and rules. Modules provide interfaces, composed by a set of methods 
which can be used to access or modify their internal state. State changes (side- 
effects) are clearly separated from read-only operations by the means of a monad 
m called Action, thus any expression which modifies state has an action type. 
Modules can be statically elaborated several times, allowing to represent com¬ 
plex circuit structures. Rules are entities which describe the connections between 
modules and ultimately define hardware dynamics. They are formed by a set of 
actions and a boolean predicate, which defines an explicit condition needed to 
allow execution of the actions (rule firing). During a single clock cycle, a rule is 
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guaranteed to entirely complete its execution or not to fire at all, property known 
as transaction atomicity. Rule firing can also be affected by implicit conditions, 
which can be attributed to any BSV expression. The BSV compiler propagates 
an implicit condition back to the predicate of the rule which actually executes 
the action or queries the value of the corresponding expression. Implicit condi¬ 
tions are usually attributed to method boundaries, serving as an effective way to 
specify module contracts. When synthesizing, the compiler defines an execution 
order for rules, allowing a hardware scheduler to be generated according to the 
Term Rewriting Systems (TRS) formalism [15] . 



Fig. 1. Block diagram of the BSV modules and rules. The gray shaded areas represent 
internal logic, while white structnres depict I/O interfaces. Modnles are portrayed 
as quadrilaterals. The names inside them are module names, and those outside are 
instance names. Small rectangles on their side are interfaces. Ellipses designate rules. 
Those which only perform connections are omitted and represented directly by arrows. 


Figure illustrates our main module. The fundamental difference between 
arbitration approaches resides on SRAMFIFO internals and on how communication 
with the SRAM occurs. In the dynamic approach this is done by a client-server 
interface mediated by FIFOs, whereas in the static one an extra central module 
exists which assigns a specihc operation to the SRAMFIFOs on each cycle. 

Acquisition data flow: Asynchronous pulses arriving from acquisition in¬ 
puts are synchronized to the system clock by the AsyncPulseSync modules, 
resulting in the syncedin signal. The blendChannelFlags rule accumulates one 
bit for each input channel in the channelFlags register, if a pulse occurred on 
syncedin since the last collected timestamp. The timestampUpdate rule atomi¬ 
cally increments the timestamp register, sends the channelFlags value and the 
current timestamp to the funnel, and resets channelFlags to zero. The funnel 
emits one byte of its input per cycle to the uartOutFifo. Finally, data coming 
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from dataOutFifo can be read in the host computer after being collected by the 
JTAG-UART transmitting (tx) interface. 

Stimulus generation data flow: Begins at the JTAG-UART receiving 
(rx) interface. The uartHandleCmd rule identifies if the byte received from the 
computer represents a start command or a DAG conversion request. A start 
command sets a boolean register (omitted from the hgure) which unblocks the 
predicate of dacLoad, timestampUpdate and blendChannelFlags rules. A DAG 
conversion request sends the current byte and the next two bytes to uartInFifo. 
After coming out of the FIFO, the bytes feed the unfunnel block, merging three 
bytes together. The dacHaindleReq controls the request flow to the DAG module. 
The dacLoad rule fires when the first input channel receives a pulse, unblocking 
the dacHandleReq rule and causing a synchronous update on all DAG outputs. 
This input channel is used to synchronize analog outputs to the desired stimuli 
clock, e.g. the display controller in a visual stimulation system. 

BSV has an implicit condition mechanism which eases the specification of a 
provably correct system. We only needed to add error handling to four places of 
our design. The first one is related to tx path FIFO overflow and is put in the 
timestampUpdate rule, ensuring that the timestamp is always incremented at 
each update period. The second check is accomplished in the dacLoad rule, and 
verifies if a FIFO underrun has occurred in the rx path, by certifying that the 
DAG is ready to receive a new command and that all DAC registers were filled 
since the last load. The final two are not directly related to system functional 
correctness, but to good debugging practices. The third one checks if bytes re¬ 
ceived from JTAG-UART correspond to valid commands. The last one verihes if 
DAG requests are still valid after leaving uartInFifo, aiming to detect any oc¬ 
currences of data corruption during communication with the SRAM chip. When 
any of these error condition occurs, we alert the user by blinking LEDs until the 
system is reset. 

Next, we describe characteristics of the common system sub-modules. 

SER/DES: Serializer and deserializer modules are implemented using shift 
registers. Our design is generic and type parametrized, making it reusable with 
any input or output data types. A code excerpt illustrating these concepts is 
shown in Figure 

DAC: In order to rapidly prototype the control of a Serial Peripheral In¬ 
terface (SPI) and DAG linearity calibration procedures, we employed a stan¬ 
dard BSV library called StmtFSM, which consists of a Domain Specific Language 
(DSL) for specifying Finite State Machines (FSMs). The FSMs could be eas¬ 
ily composed and exposed in the form of a simple external module interface. 
DAG register update requests supported by the MAX5134 DAG contain three 
bytes, one specifying the target channels and two bytes of data. Part of the FSM 
implementation is illustrated in Figure]^ 

SRAMFIFO: This module is also type parametrized. It exposes a fifo 
subinterface mimicking BSV standard library’s mkLFIFD, and a cli subinter¬ 
face which can be connected to a SRAM or SRAMSplit server interface. Usage 
of Ephemeral History Registers (EHRs) [19] greatly simplified the design in or- 
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// Defines an interface for generic types a and b 

interface Funnel#(type a, type b); 

// ... 

endinterface 
function Funnel#(a,b) 

toFunnel(FIFOF#(a) infifo, FIFOF#(b) outfifo); 
return (interface Funnel; 

method Bool notFull = infifo.notFull; 

// ... 

endinterface); 

endfunction 

module mkFunnel(Funnel#(a,b)) 

// ... 

rule firstCycle(stage == 0); 

// Shifts the input value from infifo 
shiftReg <= truncateLSB{inval << nB); 
outfifo.enq(unpack(truncateLSB(inval))); 
updateStageO; infifo.deq(); 
endrule 
rule(stage != 0); 

outfifo.enq(unpack(truncateLSB(shiftReg))); 
shiftReg <= shiftReg << nB; 
updateStageO; 

endrule 

// Returns an in/out interface from the infifo/outfifo 
return toFunnel(infifo, outfifo); 

endmodule 


Fig. 2. Code excerpt from the serializer shift register implementation, illustrating 
parametrized data types and abstract manipulation of interfaces by compile-time re¬ 
solved functions (toFunnel). 


Stmt shiftRegSenderStmt = seq 
while(!shiftRegDone) seq 
// Send each bit of shiftReg to DAC ... 

endseq; 

rnCS|oi <= 1; deiay(2): 

endseq; _ _ 

// instantiates the state machine from specified instructions 
FSM shiftRegSender <- mkFSM(shiftRegSenderStmt); 

function Action send(Bit#(24) cmd); 
action 

shiftReg <= {cmd, I'bl}; rnCS^j <= 0; 
shiftRegSender. start; 

endaction 

endfunction 

//Caiibrate DAC 

Stmt dacCalibrationStmt = seq 

delay(waitVoitageStab); send(commandBits_l): 
shiftRegSender.waitTillDone; 
delay(dacCaiibCycies); send(commandBits_2); 
endseq 
H .... 

interface Put req; 

method Action put(DACReq r) if (rdy); 

matchf.mask, .sampje} =j) _ 

// Send command for writing sampie 
send({4'b0001, mask, sampie}); 

endmethod 

endinterface 


Fig. 3. Code illustrating usage of the Finite State Machine (FSM) Domain Specific 
Language (DSL). FSMs can be specified as a series of statements (Stmt) in a DSL 
which resembles an imperative software language. The mkFSM module transforms these 
statements into a hardware implementation. We have implemented a send function 
which is used both to compose different FSMs together and to start a FSM from an 
externally accessible method. 
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der to achieve the same scheduling specifications as mkLFIFD. EHRs provide a 
register-like interface on which same-cycle accesses can be ordered according to 
the logical execution order of rules or methods. Head and tail position pointers 
to locations inside the SRAM are held in EHRs. The SRAMFIFO stores one unit 
of data (in the case of this design, one byte) in a local cache FIFO implemented 
using flip-flop registers, whose output is connected directly to SRAMFIFO’s one. 
When the cache FIFO goes empty, we dispatch a read request to the SRAM, 
aiming to maintain the cache filled most of the time. When new data is enqueued 
to the SRAMFIFO and no space is available in the cache FIFO, a write request is 
sent to the SRAM. 


3.1 Dynamic arbiter 

In this design, the SRAM controller (Figure]^ is decoupled from the SRAMSplit 
module (Figure]^. The first dispatches requests in the order as they are received 
in reqfifo, using an internal cycle register to keep track of its state during a 
single request. The latter arbitrates the access of two other modules to a single 
SRAM controller. Requests received from both modules are placed into a pair 
of FIFOs. A set of mutually exclusive rules then controls the priority of each 
request. Two of them are generated by the getPrioritizeValid function, one 
of which is fired when a single request FIFO is not empty. However when both 
FIFOs contain data, the prioritize_current_turn rule is fired, prioritizing the 
Least Recently Used (LRU) FIFO. The turn register holds the state needed to 
infer the LRU FIFO. 



Fig. 4. In the dynamic arbiter based design, the SRAM controller is decoupled from 
the SRAMSplit module. Its state over the two cycles of operation is determined by an 
internal cycle register, handled by the cyclejnachine rule. The fifo_to_wires rule 
controls the tri-state buffer and drives the outputs. 


Arbitration also needs to take place in the SRAMFIFO module, because meth¬ 
ods for enqueuing and dequeuing data are designed not to conflict, in order to 
simplify module reuse. When both methods are called during the same cycle, we 
check if the queue’s head and tail pointers are equal to each other. This means 
that the dequeue method has requested to read the same address which the 
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Fig. 5. SRAMSplit provides two SRAM server interfaces and arbitrates their access to 
a single SRAM controller. Requests coming to both servers are held into FIFOs until 
being sent to the controller. A set of three mutually exclusive rules arbitrate the access 
based on not-empty FIFO flags and on a turn register. A pending FIFO preserves the 
requester identification, allowing responses to be served back in the right order. 


interface Client cli; 


interface Get request; 


method ActionValue#(SRAMReq#|Bit#(addr_sz), tdata)) get; 


j^^Obey the scheduled request order] 

rule compute_req_tum(deq_requested_mem || enq_requested_mem); 

MemOp turn; // Read or Write enum 

if(deq_requested_mem && enq_requested_mem) begin 

case (req_turn.first) matches 

if(head[2| == tail[2]) begin 

Read: action turn = Read; req turn.deq; endaction 

// conflict: deq has read from headi2]-l and enq has written to tail[2)-l 

ReadThenWrite: action 

// enforce memory order compatible with method order: 

turn = turn_stage == I'bO ? Read : Write: 

// deq < enq implies read < write 

if(turn_stage == I'bl) req_turn.deq; 

req_turn.enq(ReadThenWrite); 

turn stage <= -turn_stage; 

end else begin 

endaction 

req_turn.enq(last_opu„ == Read ? WriteThenRead : ReadThenWrite): // LRU 

// Write and WriteThenRead implementations here ... 

end 

endcase 

end else if(deq_requested_mem && !enq_requested_mem) begin 

last_op|ii <= turn; 

req_turn.enq(Read); 

//'Forward request to SRAMSpiit 

end else if(!deq_requested_mem && enq_requested_mem) begin 

if (turn == Read) begin 

req_turn.enq( Write); 

req_read.deq; return tagged Read req read.first; 

end 

end else /* Write request here ... */ 

endrule 

endinterface 


interface Put response; /* ... *! endinterface 


endinterface 



Fig. 6. Excerpts from the dynamic arbitrated SRAMFIFO implementation. All of our 
modules follow a client/server pattern, demanding the treatment of simultaneous re¬ 
quests for full decoupling. The compute_req_turn rule (right panel) chooses the turn 
according to which requests were issued at the current cycle. Whenever possible, a Least 
Recently Used (LRU) scheme is adopted (highlighted in the code). The request.get 
method of the client interface (left panel) queries this information from the req_turn 
FIFO, updates last_op and forwards the correct request by returning it. The last_op 
is an example of an Ephemeral History Register (EHR) — references to it must be 
appended with an index (shown as subscript text in the code) which defines the logical 
execution order of register read and write operations. 






enqueue method asked to write. In this case, we enforce requests to be sent to 
SRAM in the same order as the logical execution order chosen when designing 
the methods (dequeue before enqueue, and thus read before write). Otherwise, 
we follow a LRU approach based on the value of a register which holds the type 
of the last memory operation issued by the SRAMFIFO module (Figure]^. 


3.2 Static arbiter 

Starting from the code of the dynamic version, we incrementally added new 
conditions to method predicates, testing the system after the changes. As implicit 
conditions which control the data flow of FIFOs are still present in the logic at 
this development stage, designer errors tend to prevent rules from firing, stopping 
data flow and making the system hang instead of producing incorrect results. 
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Fig. 7 . Timing diagram of SRAMFIFO 
transactions governed by the static ar¬ 
biter. Arrows identify in which cycles 
the memory requests and responses oc¬ 
cur. White rectangles represent actions 
on uartInFifo, while black rectangles 
depict operations on uartOutFifo. 


We added these predicate conditions based on a manually devised arbitration 
schedule, shown in Figurej^ This schedule allows for the execution of an enqueue 
and a dequeue operation on both uartInFifo and uartOutFifo during the 
course of 8 clock cycles. A central arbiter, which consists of a counter reset 
every 8 cycles, was implemented just below the top level module. Boolean values 
derived from this counter, signaling if each operation could occur during each 
cycle, were routed from the top level module to the inner SRAM controller, 
SRAMSplit and SRAMFIFOs (Figure]^. After the predicates were changed, some 
FIFOs could be removed, reducing the number of LEs needed to implement the 
design. 

SRAM controller: The cycle register (compare with Figure was re¬ 
moved and replaced by the least significant bit of the central arbiter counter. 
Memory requests became allowed only when this bit is zero, which happens in 
cycles numbered 0, 2, 4 and 6 (Figure [7]). 

SRAMSplit: Requests to one of the arbitrated servers became allowed only 
during the correct cycle, as defined by the timing diagram — requests coming 
from uartInFifo at cycles numbered 0 and 2, and those from uartOutFifo at 
cycles 4 and 6. Both reqfifos could then be removed from the design without 
affecting its behavior. The order of responses could also be inferred from the 
diagram (cycles 3, 5, 7 and 1), allowing us to remove the pending FIFO. After 
all changes, the module became just an abstraction which synthesizes purely to 
wires (compare with Figure]^. 
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R Central Arbiter definition 
Just a cyclic counter^^J 

interface CentralArbiter#(numeric type arbitrated_unit5); 

interface ReadOnly#(Bit#(TLog#(arbitrated_units))) turn; 
endinterface 

module mkCentralArbiter(CentralArbiter#(n)); 

Reg#(Bit#(TLog#(n))) turnCounter <- mkRegU; 

ru le incrementTurn; _ 

[let maxCount = fromlnteger(valueOf(n) - 1);| _ 

[turnCounter <= (turnCounter == maxCount) ? 0 : (turnCounter + 1)7| 

endrule 

interface Readonly turn = regToReadOnly(turnCounter); 

endmodule 

// On the top module, wt! [JJLL LIU LU Llle lliuddles 

CentralArbiter#(8) arb <- mkCentralArbiter; 

SRAMSplit#(Ad drSize, Byte) sram _ 

<- mkSRAMSDlit(hrb.turnf21, arb.turnfOl, arb.turn = = 5, arb.tum == 1 ); 
SRAMFIFO#(Fif oAddrSize, Byte) uartInFifo 
<- mkSRAMFIFO( lrb,tum == 2, arb.turn == OP ; 

SRAMFIFO#(Fif oAddrSize, Byte) uartOutFifo 
<- mkSRAMFIFOIlarb.tum == 4, arb.turn ==^); 


)i iiiPiiL [M luiyk yiiu yaiuiiL m m mi; iiyiii 

// cycle by inserting predicate conditions 

module mkSRAM#(Bit#{l) turn) (SRAM#(taddr, tdata)) 

//... 

rule cycle_machinG(firn _ 

/*...*/ [// Notice the cycle dependent predicaT^ 

endrule 

interface Server Ifc; interface Put request; 

method Action put(SRAMReq#(taddr, tdata) req) 
if (turn == 0); // Method implicit condition 
'll ... 

endmethod 

endinterface; /* ... *! endinterface; 
endmodule 


// We also attributed implicit conditions to specific 
// actions which depend on the correct cycle _ 

module mkSFtAMFIFO#(Bool turnRead, Bool turnWrite) 
(SRAMFIFO#(addr_sz, tdata)) 

H ... 

method Action enq(tdata 
// ... 

when(turnWrite, action 
H ... 

endaction); 
endmethod 
endmodule 


x) if (not_ring_full[i|): 


S' when adds implicit conditions 
y to an action expression 


Fig. 8. Excerpts from the static arbiter implementation. The CentralArbiter module, 
which consists of a simple cyclic counter, is instantiated inside the top-level module. Its 
turn method describes the current cycle according to the schedule of Figure Some 
bits of the cycle counter, or Boolean conditions involving its current state, are routed to 
inner modules, where they are appended to predicates or added as implicit conditions. 


SRAMFIFO: Dequeues and enqueues became allowed only during the des¬ 
ignated cycles (2 and 4 for dequeues, 0 and 6 for enqueues). This allowed to 
remove memory request output FIFOs, which were replaced by wires. 


4 Results 

4.1 Synthesis 

Synthesis results are shown in Table [T] On the device actually adopted in our 
project (EPM2210F324C3), the dynamic arbitrated circuit occupies 200 more 
LEs than the static arbiter design. This corresponds to 9% of the LEs available 
in the CPLD. Almost a half of the hardware resources are still free and could be 
exploited to implement new features. We have also synthesized both architectures 
on a smaller device (EPM1270F256C3) in order to demonstrate the design can 
meet the requirements even when reaching the limits of the CPLD substrate. 
The Quartus II Fitter clearly undertook more effort during synthesis on this 
device, as both circuits were implemented occupying less LEs than on the bigger 
CPLD. Nonetheless, the attainable clock frequency was not significantly reduced 
by this area optimization, remaining above 50 MHz. 


4.2 Experimental validation 

Workbench validation consisted in connecting independent square-wave periodic 
signal generators into each input of the system for 8 hours and then analyzing 
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Table 1. Synthesis results for both arbiter designs (Altera Quartus II 14.0) 



EPM2210F324G3 

EPM1270F256G3 

Design 

Arbiter 

Logic 

elements 

Maximum clock 
frequency 

Logic 

elements 

Maximum clock 
frequency 

Static 

1017 (46%) 

54.57 MHz 

970 (76%) 

54.36 MHz 

Dynamic 

1217 (55%) 

54.07 MHz 

1168 (92%) 

53.49 MHz 




At {fiS) At (;is) At (/is) 

Fig. 9. Histograms displaying the number of occurrences of each time interval At 
measnred between two consecntive pnlses. Each one of the six inputs channels was 
connected to an independent periodic signal source. No missed nor spurions events 
could be observed even after eight hours of acquisition. 


the acquired data to look for spurious or missing detections. Figure shows 
histograms of the time interval (At) between two consecutive recorded pulses 
for all input channels. Histograms shown in the first line correspond to periodic 
pulses generated by an 1-channel Hewlett-Packard 33120A and a 2-channel Sony- 
Tektronix AFG320 function generator. The three remaining channels were fed 
with signals generated by free-running astable oscillators made using the NE555 
timer integrated circuit. 

The first input channel was programmed to synchronize DAC conversions, 
thus we have fed it with 500 Hz (frame rate frequency adopted by the VSImG [TS] 
visual stimulation system). The second and third channels were supplied with 
close but incommensurable frequencies (6.2 kHz and 6.1 kHz). The remaining 
channels were fed by similar frequencies, produced by three identical NE555 
circuits, differing only within component nominal tolerances. 
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Experiments with both architectures (dynamic and static arbiter) resulted in 
almost identical histograms, thus the figure only portrays the results for one of 
them (dynamic arbiter). The histograms show that during 8 hours of acquisition 
no pulses were missed, and no spurious events were registered, otherwise the 
abscissa of the graph would reach double or half the value of the baseline period, 
respectively. The maximum deviation from the adjusted periods was within the 
acceptable generator’s thermal drift. As expected, NE555 oscillators are less sta¬ 
ble and produce more jitter than the commercial function generators, resulting 
in sparser histograms. 

We emphasize that input event periods employed during this test were well 
below the minimum intervals between spikes (refractory period) attainable by 
a typical neuron. For example, in Chrysomya megacephala's HI neuron this 
minimum interval is 2 ms |20j . 


4.3 Arbiter resilience evaluation 

Besides the dynamic arbiter design advantages related to code reusability, in¬ 
herent to its latency insensitive and decoupled characteristics, it is also more 
resilient to failures. In order to prove this, we needed to increase the timestamp 
resolution above its original specification of 1 /iS. In fact, any update rate be¬ 
low 1/40 of the clock frequency (50 MHz) can always be correctly scheduled, as it 
leaves room for at least 5 rounds of 4 memory operations, each one taking 2 cy¬ 
cles (see Figure]^, sufficient to carry the 5-byte (channelFlags, timestamp) 
tuple in and out of the FIFOs. Thus, to be able to observe failures related to 
differences in SRAM arbitration schemes, we increased the timestamp counter 
update rate from 1/50 to 1/20 of the clock frequency. 

In order to keep control over parameters such as UART transmission rate, we 
simulated the system instead of evaluating it with workbench instruments. Both 
architectures were simulated under exactly the same parameters and inputs. 
Inputs were fed with trains of pulses generated according to a Poisson process, a 
stochastic model that occasionally produces activity bursts, although it possesses 
a parameterized mean rate. It has also been adopted in some statistical models 
of spiking neurons [1]. The first channel, however, was modeled as an oscillation 
whose frequency varies according to a narrow Gaussian distribution, which better 
reproduces the behavior of the stimuli reference clock. 

We limited UART transmission to 1 byte per 6 cycles, aiming to observe the 
system in a regime in which it would eventually acquire more data than it could 
transmit. UART reception was constrained to 1 byte per 10 cycles, allowing 3- 
byte chunks of stimuli data to be provided to the system at double the speed 
required by the first-channel reference clock, whose mean frequency was chosen 
at i/so of the system clock. 

Varying the rate parameter of the Poisson processes, we measured the total 
mean input event rate to which the system was exposed, i.e. the number of 
input events divided by the number of cycles of simulation, relating it to the 
mean time between failures (MTBF) in number of cycles. Failures were detected 
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Fig. 10. Mean time between failures (MTBF) obtained by simulation, with designs 
configured for an increased timestamp resolution of 1/20 of the system clock, and subject 
to large mean input event rates. The dynamic arbiter is consistently more resilient, 
and presented two operating regimes: in regime A, failures are caused by overflow of 
uartOutFifo, while in regime B, memory write requests occur at a high rate and 
eventually cannot be scheduled on time, overflowing funnel. As the static arbiter 
schedules enqueuing operations at a fixed rate of 1 byte per 8 cycles (slower than 
the simulated UART transmission rate), regime C is only due to funnel overflow. 
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if any of the four error conditions discussed in Section were triggered. In this 
experiment they happened only due to overflows in the tx path. 

Figure shows these results and demonstrates that besides supporting a 
mean firing rate greater than the static arbiter without missing events, the dy¬ 
namic arbiter takes longer to fail when the frequency approaches its limits. At 
an input rate of 1 / 20 , the maximum meaningful frequency at the adopted time- 
stamp resolution, the dynamic arbiter fails after circa 10^ cycles, whereas the 
static arbiter withstands for only 10^ cycles. 

Under the simulation parameters, the static arbiter is not able to fill uart- 
DutFifo’s SRAM-contained ring with more than 1 byte. This happens because 
the FIFO enqueuing rate is limited to a maximum of 1 byte per 8 cycles by the 
central arbiter schedule (see Figure]^, while the simulated UART can reach a 
transmission rate of 1 byte per 6 cycles, sufficient to keep the FIFO almost empty. 
The observed failures were due to overflow of the funnel: the uartOutFif o never 
even came close to a full state. Therefore, the failure process may be viewed as 
nearly stationary at a time scale greater than 10^ cycles. Indeed, the measured 
data set (Figure [l0|-C) does not change significantly if we reset the circuit state 
a couple of times during the course of simulation. 

On the other hand, the dynamic arbiter is able to surpass an operating regime 
(Figure [l0|-B) where failures occur because of funnel overflow, reaching another 
regime at lower frequencies (Figure [l0] -A) where the system does not abort until 
uartOutFifo is full. The dashed curve is proportional to (/ — /iim)~^) where / 
is the mean input frequency (abscissa) and /lim is a limit frequency (in this 
experiment, fum ~ 0.0282) such that S/nm is less than the effective UART 
transmission rate, implying the circuit is not expected to ever fail for / < /um- 

5 Conclusions 

The system described in this paper (source code at m can be applied to neuro¬ 
science research both on in vivo or in vitro experiments requiring deterministic 
timing and synchronous stimuli generation, such as the study of neural coding on 
the visual system of flies m- It can also be applied to experiments in neuroethol¬ 
ogy, for example on the analysis of electrocommunication signals produced by 
pulse-type electric fish [221231241^ . The employed digital pulse timestamping 
technique allows to achieve a measurement precision in the order of 1 /iS, much 
higher than most ADC-based acquisition systems. Even though our project was 
programmed to a small reconfigurable device, almost a half of LEs were left free 
and can be filled to implement future experiments with real-time feedback |26j . 

We have also shown that Bluespec SystemVerilog (BSV) can be effectively 
used even in projects involving small devices, and have presented an approach 
to refactor a decoupled and latency insensitive logic into a statically arbitrated 
one, which could be useful when a designer needs to quickly lower a system’s LE 
usage — however, keeping dynamic arbitration can compensate the LE cost if 
the system needs to be resilient to activity bursts. The designed code is modular 
and reusable to implement similar systems, e.g. we have a working prototype for 
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closed-loop experiments implemented on an EP4CGX150DF31C7 FPGA, which 
occupies 5728 LFs (4% of the device) and interfaces with a real-time operating 
system (RTOS) through PCI Express P7] . 
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